IPL Data Analysis

Introduction

Sports analytics is one of the metric which will be done in all types of games all over the world. This will not enhance the prediction of the game, it will help us in analyzing the team performance and also the individual player performance through which team can improve its performance and drive towards the winning line. In this analysis we will be analyzing one of the famous sports cricket and we will be taking data of the IPL game.

Loading required packages for the project
# loading required packages
library(lubridate)
library(tidyverse)
library(gapminder)
library(ggplot2)
library(knitr)
library(skimr)
library(dplyr)
library(ggthemes)
library(data.table)
library(reshape)
library(insight)
library(stringr)
library(plotly)

Explaining the data set

  • Overview Dataset

  • Matches Dataset

  • Deliveries Dataset

  • Deliveries2 Dataset

Loading the data sets into r

# reading all the csv files from the data sets

overview <- read.csv("overview.csv",na.strings=c("","NA"))
matches <- read.csv("matches.csv",na.strings=c("","NA"))
deliveries <- read.csv("deliveries.csv",na.strings=c("","NA"))
deliveries2 <- read.csv("deliveries2.csv",na.strings=c("","NA"))

Our data set focuses more on the cricket game, so we want to analyze it in a way that allows us to explore different data sets and make more in-depth observations about the variables and any missing values. We opt for the matches data set for this. Matches was chosen primarily because it is the data set that contains the most accurate information about the important labels that we need to combine to show which teams won more games.

Exploring Important Variables in the data sets

Match Dataset

# matches data set

glimpse(matches)
## Rows: 756
## Columns: 18
## $ id              <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ season          <int> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, …
## $ city            <chr> "Hyderabad", "Pune", "Rajkot", "Indore", "Bangalore", …
## $ date            <chr> "2017-04-05", "2017-04-06", "2017-04-07", "2017-04-08"…
## $ team1           <chr> "Sunrisers Hyderabad", "Mumbai Indians", "Gujarat Lion…
## $ team2           <chr> "Royal Challengers Bangalore", "Rising Pune Supergiant…
## $ toss_winner     <chr> "Royal Challengers Bangalore", "Rising Pune Supergiant…
## $ toss_decision   <chr> "field", "field", "field", "field", "bat", "field", "f…
## $ result          <chr> "normal", "normal", "normal", "normal", "normal", "nor…
## $ dl_applied      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ winner          <chr> "Sunrisers Hyderabad", "Rising Pune Supergiant", "Kolk…
## $ win_by_runs     <int> 35, 0, 0, 0, 15, 0, 0, 0, 97, 0, 0, 0, 0, 17, 51, 0, 2…
## $ win_by_wickets  <int> 0, 7, 10, 6, 0, 9, 4, 8, 0, 4, 8, 4, 7, 0, 0, 6, 0, 4,…
## $ player_of_match <chr> "Yuvraj Singh", "SPD Smith", "CA Lynn", "GJ Maxwell", …
## $ venue           <chr> "Rajiv Gandhi International Stadium, Uppal", "Maharash…
## $ umpire1         <chr> "AY Dandekar", "A Nand Kishore", "Nitin Menon", "AK Ch…
## $ umpire2         <chr> "NJ Llong", "S Ravi", "CK Nandan", "C Shamshuddin", NA…
## $ umpire3         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
skim(matches)
Data summary
Name matches
Number of rows 756
Number of columns 18
_______________________
Column type frequency:
character 13
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
city 7 0.99 4 14 0 32 0
date 0 1.00 8 10 0 546 0
team1 0 1.00 13 27 0 15 0
team2 0 1.00 13 27 0 15 0
toss_winner 0 1.00 13 27 0 15 0
toss_decision 0 1.00 3 5 0 2 0
result 0 1.00 3 9 0 3 0
winner 4 0.99 13 27 0 15 0
player_of_match 4 0.99 5 17 0 226 0
venue 0 1.00 8 52 0 41 0
umpire1 2 1.00 5 21 0 61 0
umpire2 2 1.00 5 21 0 65 0
umpire3 637 0.16 6 23 0 25 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1 1792.18 3464.48 1 189.75 378.5 567.25 11415 ▇▁▁▁▁
season 0 1 2013.44 3.37 2008 2011.00 2013.0 2016.00 2019 ▇▆▆▅▇
dl_applied 0 1 0.03 0.16 0 0.00 0.0 0.00 1 ▇▁▁▁▁
win_by_runs 0 1 13.28 23.47 0 0.00 0.0 19.00 146 ▇▁▁▁▁
win_by_wickets 0 1 3.35 3.39 0 0.00 4.0 6.00 10 ▇▁▃▃▁

Delivery Dataset

# deliveries data set

glimpse(deliveries)
## Rows: 179,078
## Columns: 21
## $ match_id         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ inning           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ batting_team     <chr> "Sunrisers Hyderabad", "Sunrisers Hyderabad", "Sunris…
## $ bowling_team     <chr> "Royal Challengers Bangalore", "Royal Challengers Ban…
## $ over             <int> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3,…
## $ ball             <int> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4,…
## $ batsman          <chr> "DA Warner", "DA Warner", "DA Warner", "DA Warner", "…
## $ non_striker      <chr> "S Dhawan", "S Dhawan", "S Dhawan", "S Dhawan", "S Dh…
## $ bowler           <chr> "TS Mills", "TS Mills", "TS Mills", "TS Mills", "TS M…
## $ is_super_over    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ wide_runs        <int> 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bye_runs         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ legbye_runs      <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ noball_runs      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ penalty_runs     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ batsman_runs     <int> 0, 0, 4, 0, 0, 0, 0, 1, 4, 0, 6, 0, 0, 4, 1, 0, 0, 3,…
## $ extra_runs       <int> 0, 0, 0, 0, 2, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ total_runs       <int> 0, 0, 4, 0, 2, 0, 1, 1, 4, 1, 6, 0, 0, 4, 1, 0, 0, 3,…
## $ player_dismissed <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "DA Warne…
## $ dismissal_kind   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "caught",…
## $ fielder          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Mandeep …
skim(deliveries)
Data summary
Name deliveries
Number of rows 179078
Number of columns 21
_______________________
Column type frequency:
character 8
numeric 13
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
batting_team 0 1.00 13 27 0 15 0
bowling_team 0 1.00 13 27 0 15 0
batsman 0 1.00 5 20 0 516 0
non_striker 0 1.00 5 20 0 511 0
bowler 0 1.00 5 17 0 405 0
player_dismissed 170244 0.05 5 20 0 487 0
dismissal_kind 170244 0.05 3 21 0 9 0
fielder 172630 0.04 5 21 0 499 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
match_id 0 1 1802.25 3472.32 1 190 379 567 11415 ▇▁▁▁▁
inning 0 1 1.48 0.50 1 1 1 2 5 ▇▇▁▁▁
over 0 1 10.16 5.68 1 5 10 15 20 ▇▇▇▇▇
ball 0 1 3.62 1.81 1 2 4 5 9 ▇▇▃▅▁
is_super_over 0 1 0.00 0.02 0 0 0 0 1 ▇▁▁▁▁
wide_runs 0 1 0.04 0.25 0 0 0 0 5 ▇▁▁▁▁
bye_runs 0 1 0.00 0.12 0 0 0 0 4 ▇▁▁▁▁
legbye_runs 0 1 0.02 0.19 0 0 0 0 5 ▇▁▁▁▁
noball_runs 0 1 0.00 0.07 0 0 0 0 5 ▇▁▁▁▁
penalty_runs 0 1 0.00 0.02 0 0 0 0 5 ▇▁▁▁▁
batsman_runs 0 1 1.25 1.61 0 0 1 1 7 ▇▁▁▁▁
extra_runs 0 1 0.07 0.34 0 0 0 0 7 ▇▁▁▁▁
total_runs 0 1 1.31 1.61 0 0 1 1 10 ▇▁▁▁▁

Missing values, Lubridate, Stringr & Summary Statistics

Dealing with missing values

We found some missing values in the data set based on the above individual reports. However, because the missing values in the data set are very low and will have no effect on the final visualization, we decided to omit missing values. Creating a new data sets by removing the “NA” values as shown below.

# omit missing values for matches dataset
matches_o <- na.omit(matches) 

After removing the NA values in the data set we again using the glimpse and skim functions to get new insights about the data sets.

glimpse(matches_o)
## Rows: 118
## Columns: 18
## $ id              <int> 7894, 7895, 7896, 7897, 7898, 7899, 7900, 7901, 7902, …
## $ season          <int> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, …
## $ city            <chr> "Mumbai", "Mohali", "Kolkata", "Hyderabad", "Chennai",…
## $ date            <chr> "07/04/18", "08/04/18", "08/04/18", "09/04/18", "10/04…
## $ team1           <chr> "Mumbai Indians", "Delhi Daredevils", "Royal Challenge…
## $ team2           <chr> "Chennai Super Kings", "Kings XI Punjab", "Kolkata Kni…
## $ toss_winner     <chr> "Chennai Super Kings", "Kings XI Punjab", "Kolkata Kni…
## $ toss_decision   <chr> "field", "field", "field", "field", "field", "field", …
## $ result          <chr> "normal", "normal", "normal", "normal", "normal", "nor…
## $ dl_applied      <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
## $ winner          <chr> "Chennai Super Kings", "Kings XI Punjab", "Kolkata Kni…
## $ win_by_runs     <int> 0, 0, 0, 0, 0, 10, 0, 0, 0, 0, 19, 4, 71, 46, 0, 15, 6…
## $ win_by_wickets  <int> 1, 6, 4, 9, 5, 0, 1, 4, 7, 5, 0, 0, 0, 0, 7, 0, 0, 9, …
## $ player_of_match <chr> "DJ Bravo", "KL Rahul", "SP Narine", "S Dhawan", "SW B…
## $ venue           <chr> "Wankhede Stadium", "Punjab Cricket Association IS Bin…
## $ umpire1         <chr> "Chris Gaffaney", "Rod Tucker", "C Shamshuddin", "Nige…
## $ umpire2         <chr> "A Nanda Kishore", "K Ananthapadmanabhan", "A.D Deshmu…
## $ umpire3         <chr> "Anil Chaudhary", "Nitin Menon", "S Ravi", "O Nandan",…
skim(matches_o)
Data summary
Name matches_o
Number of rows 118
Number of columns 18
_______________________
Column type frequency:
character 13
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
city 0 1 4 13 0 11 0
date 0 1 8 8 0 94 0
team1 0 1 14 27 0 9 0
team2 0 1 14 27 0 9 0
toss_winner 0 1 14 27 0 9 0
toss_decision 0 1 3 5 0 2 0
result 0 1 3 6 0 2 0
winner 0 1 14 27 0 9 0
player_of_match 0 1 6 15 0 61 0
venue 0 1 12 52 0 16 0
umpire1 0 1 6 21 0 21 0
umpire2 0 1 6 21 0 23 0
umpire3 0 1 6 23 0 25 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1 9572.61 1685.65 7894 7923.25 7952.5 11319.75 11415 ▇▁▁▁▇
season 0 1 2018.49 0.50 2018 2018.00 2018.0 2019.00 2019 ▇▁▁▁▇
dl_applied 0 1 0.03 0.16 0 0.00 0.0 0.00 1 ▇▁▁▁▁
win_by_runs 0 1 11.36 21.09 0 0.00 0.0 14.00 118 ▇▁▁▁▁
win_by_wickets 0 1 3.27 3.23 0 0.00 4.0 6.00 10 ▇▂▅▂▁

By Using the Lubridate Function created a new column for day

# By using the weekday from the lubridate library we have created a new column to get the game played day

matches$Day <- wday(as_date(matches$date))

By Using the stringr Function counting the player of the match

# Using str_count we will be checking number of times player of the match repeated

p_o_m <- str_count(matches_o$player_of_match, "S Dhawan")

p_o_m_count <- sum(p_o_m)

p_o_m_count
## [1] 4

We wonder to check how many times did the player “S Dhawan” won the “player of match”. We found it has 4

Summary statistics for two quantitative variables

Statistics for the win_by_runs grouping with variable city
# summary statistics for the win_by_runs in each city.

data.table(matches_o)[, as.list(summary(win_by_runs)), by="city"]
##              city  Min. 1st Qu. Median      Mean 3rd Qu.  Max.
##            <char> <num>   <num>  <num>     <num>   <num> <num>
##  1:        Mumbai     0       0      0 10.437500   17.50    46
##  2:        Mohali     0       0      0  4.500000   10.00    15
##  3:       Kolkata     0       0      0 17.750000   25.75   102
##  4:     Hyderabad     0       0      1 17.666667   26.00   118
##  5:       Chennai     0       0      0 17.333333   22.00    80
##  6:        Jaipur     0       0      0  5.714286   10.75    30
##  7:     Bengaluru     0       0      0  5.461538   14.00    19
##  8:          Pune     0       0      0 12.833333    9.75    64
##  9:         Delhi     0       0      2 11.714286   14.75    55
## 10:        Indore     0       0      0  7.750000    7.75    31
## 11: Visakhapatnam     0       0      0  0.000000    0.00     0
Statistics for the win_by_wickets grouping with variable team1
#summary statistics for the win_by_wickets for each team.

data.table(matches_o)[, as.list(summary(win_by_wickets)), by="team1"]
##                          team1  Min. 1st Qu. Median     Mean 3rd Qu.  Max.
##                         <char> <num>   <num>  <num>    <num>   <num> <num>
## 1:              Mumbai Indians     0       0    0.0 1.894737    3.50     8
## 2:            Delhi Daredevils     0       0    5.0 3.666667    6.00     9
## 3: Royal Challengers Bangalore     0       0    4.5 3.500000    5.75     7
## 4:            Rajasthan Royals     0       0    5.0 4.307692    6.00     9
## 5:       Kolkata Knight Riders     0       0    5.0 3.933333    7.00     9
## 6:             Kings XI Punjab     0       0    3.5 3.428571    5.75    10
## 7:         Chennai Super Kings     0       0    2.0 3.000000    6.00     8
## 8:         Sunrisers Hyderabad     0       0    3.0 3.250000    6.00     8
## 9:              Delhi Capitals     0       0    2.5 2.833333    5.75     6

Frequency table for two categorical variables

# Generating the frequency table using table function

freq_table <- table(matches_o$winner, matches_o$toss_decision)
freq_table
##                              
##                               bat field
##   Chennai Super Kings           2    19
##   Delhi Capitals                2     7
##   Delhi Daredevils              1     4
##   Kings XI Punjab               1    11
##   Kolkata Knight Riders         1    14
##   Mumbai Indians                4    13
##   Rajasthan Royals              4     8
##   Royal Challengers Bangalore   0    11
##   Sunrisers Hyderabad           5    11

We thought to divided the column “toss_decision” to know which who is the winner and what of decision they took and by what number of runs they won the match.

By using pivot_wider

#created a pivot-wider for toss_decision

decision_wider <- matches_o %>%
  pivot_wider(id_cols= id:toss_winner,
              names_from = toss_decision, 
              values_from = win_by_runs, 
              values_fill = 0)
decision_wider
## # A tibble: 118 × 9
##       id season city      date     team1           team2 toss_winner field   bat
##    <int>  <int> <chr>     <chr>    <chr>           <chr> <chr>       <int> <int>
##  1  7894   2018 Mumbai    07/04/18 Mumbai Indians  Chen… Chennai Su…     0     0
##  2  7895   2018 Mohali    08/04/18 Delhi Daredevi… King… Kings XI P…     0     0
##  3  7896   2018 Kolkata   08/04/18 Royal Challeng… Kolk… Kolkata Kn…     0     0
##  4  7897   2018 Hyderabad 09/04/18 Rajasthan Roya… Sunr… Sunrisers …     0     0
##  5  7898   2018 Chennai   10/04/18 Kolkata Knight… Chen… Chennai Su…     0     0
##  6  7899   2018 Jaipur    11/04/18 Rajasthan Roya… Delh… Delhi Dare…    10     0
##  7  7900   2018 Hyderabad 12/04/18 Mumbai Indians  Sunr… Sunrisers …     0     0
##  8  7901   2018 Bengaluru 13/04/18 Kings XI Punjab Roya… Royal Chal…     0     0
##  9  7902   2018 Mumbai    14/04/18 Mumbai Indians  Delh… Delhi Dare…     0     0
## 10  7903   2018 Kolkata   14/04/18 Kolkata Knight… Sunr… Sunrisers …     0     0
## # ℹ 108 more rows

Data Dictionary

Create a data dictionary showcasing the variables used in your analyses

# create a new data with only one row for data dictionary

mat <- head(matches, 1)
del <- head(deliveries, 1)


# merging two different data into one dataset

Match_Del <- merge(mat,del)

# Extracting only required columns used in the analysis

Match_Del_new <- subset(Match_Del, select=c("win_by_runs", "city", "win_by_wickets", "team1", "team2","winner","toss_decision","toss_winner", "batsman_runs"))

# Creating dictionary table for used variables 

dataDictionary <- tibble(Variable = colnames(Match_Del_new),
                         Description = c("Winning run by batting team",
                                         "Matches held in which city",
                                         "Winning wickets by bowling team", 
                                         "Teams in Group 1", "Teams in Group 2",
                                         "Name of the Winning Team", 
                                         "Decision taken by team either bat or field",
                                         "Name of the team winning toss", 
                                         "Number of runs scored by each player"),
                         Type = map_chr(Match_Del_new, .f = function(x){typeof(x)[1]}))

knitr::kable(dataDictionary)
Variable Description Type
win_by_runs Winning run by batting team integer
city Matches held in which city character
win_by_wickets Winning wickets by bowling team integer
team1 Teams in Group 1 character
team2 Teams in Group 2 character
winner Name of the Winning Team character
toss_decision Decision taken by team either bat or field character
toss_winner Name of the team winning toss character
batsman_runs Number of runs scored by each player integer

Data Visualizations

Top 10 players with highest number of runs

Bar Chart
We have created a interactive graph for bar chart using plotly

For any game particularly for cricket, we need to check which total number of batsman runs. This will help us in analyzing the run strike of each batsman. To do that, we will be using the deliveries data set.

# Creating new variable using the deliveries data set

Top_Batsman<- deliveries %>% 
  group_by(batsman)%>%
  summarise(runs=sum(batsman_runs)) %>% 
  arrange((runs)) %>%
  filter(runs > 3000)

# Creating new variable for top_10 batsman  

Top_10_Batsman <- Top_Batsman %>% 
  top_n(n=10,wt=runs) %>%
  ggplot(aes(reorder(batsman, -runs),runs,fill=batsman)) +
  labs(title = "Top 10 Batsman with highest number of runs in IPL",
       x= "Batsman",
       y= "Runs",
       caption = "Data source: IPL Dataset from Kaggle")+
  scale_fill_viridis_d()+
  geom_bar(stat = "identity")+
  geom_text(aes(label = runs), 
            vjust = 0.5, size= 3) +
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 45, vjust = 0.4),
        legend.position = "none")

ggplotly(Top_10_Batsman)

From the above plot, we can say that player “V Kohli” as the highest number of runs “5434”.

Total number of matches in each city

Line Chart
# Creating new dataframe for the line chart

matches_cities <- matches %>% select(id:winner)%>%
  group_by(city) %>%
  summarise(Total= n())

# Generating new plot using the above data frame

Different_cities<- matches_cities %>% 
  filter(!is.na(city)) %>%
  ggplot()+
  aes(x= city, y = Total, group= 1)+
  geom_line(color = "#00abff")+
  labs(title = "Number of Matches played in different cities",
       x= "City",
       y= "Total Matches",
       caption = "Data source: IPL Dataset from Kaggle")+
  scale_color_continuous()+
  geom_text(aes(label = Total), 
            vjust = -0.125) +
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

Different_cities

Above line chart clearly indicates that most number of matches where organized in “Mumbai= 101” and the least number of matches where organized in “Bloemfontein = 2”

Teams which have won the highest number of toss

Pie-Chart
# creating a new variable matches_p to get the highest number of toss winner w.r.t to teams 

matches_p <- matches %>% 
  group_by(toss_winner)%>%
  summarise(Percentage= n())

matches_p$Percentage <- round(matches_p$Percentage/sum(matches_p$Percentage)*100, digits = 1)

matches_p_new <- matches_p %>% 
  top_n(n=10, wt= Percentage)
  
# Generating a pie chart for the highest percentage highest number of toss

matches_p_new %>% ggplot()+
  aes(x = "", y = -Percentage,fill = reorder(toss_winner, -Percentage)) + 
  geom_bar(stat = "identity", width= 1, color = "black") + 
  labs(title = "Team with highest toss winning (%)",
       caption = "Data source: IPL Dataset from Kaggle",
       fill ="Winning Teams") +
  coord_polar("y", start = 0) +
  theme_void()+
  geom_text(aes(label = Percentage), position = position_stack(vjust = 0.5),
            color = "black", size=2.9)+
  scale_color_viridis_d()

The pie chart clearly shows that the team “Mumbai Indians” has won the toss more number of times in comparison of other teams. However, team “Pune” lowest number in winning the toss.

Merge at least two tables, and create a plot or table of summary statistics that is a result of the merged data set

Stacked Bar-chart with Line Graph
# creating two tables from the matches data set

matches_won<-as.data.frame(table(matches$winner))
matches_played<-as.data.frame(table(matches$team2) + table(matches$team1))

# Re-writing the column names for the above data sets

colnames(matches_won) <- c('Team','Won')
colnames(matches_played) <- c('Team','Played')

# merging above two data sets with the function merge

matches_w_p <- merge(matches_won, matches_played)

matches_per <- matches_w_p %>%
  group_by(Team, Won, Played)%>%
  summarise(Win_Percent = round((Won/Played)*100, digit=0))

matches_per_new <- as.data.frame(matches_per)

# Generating new plot with the merged data set using pivot_longer

# Stacked Bar chart with line graph on top

Stacked_Bar_Line <- matches_per_new %>% pivot_longer(Won:Played)%>% 
  ggplot(aes(x = Team)) +
  geom_bar(stat = "identity", aes(y = value,fill = name))+
  geom_line(aes(y = 3*Win_Percent), size = 0.5, color="red", group = 1)+
  geom_text(position=position_stack(vjust = .5),
            aes(x = Team, y = value, label = value), size= 3)+
  scale_y_continuous(
    name = "Won & Played",
    breaks = seq (0, 300, 50),
    sec.axis = sec_axis(~.*2/3, name="Win Percentage %", breaks = seq (0, 300, 50)))+
  labs(title = "Total number of Matches Played vs Won by each team",
       x= "Teams",
       y= "Count",
       fill= "",
       caption = "Data source: IPL Dataset from Kaggle")+
  scale_fill_manual(values = c("grey47", "grey"))+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.50, size = 5.2), 
        legend.position = "right")


Stacked_Bar_Line

According to the stacked bar chart above, the Mumbai Indians have played the most matches and won the most games overall.

Count for the number of 50s and 100s in IPL

Histogram
# creating histogram to check the 50s and 100s in IPL
Hist_graph <- hist(matches$win_by_runs,
                   main="Maximum number of 50s and 100s in IPL game",
                   xlab="Number of runs in IPL",
                   ylab= "Frequency of runs",
                   col = "darkslategray1")

text(Hist_graph$mids,Hist_graph$counts,labels=Hist_graph$counts, adj=c(0.5, -0.5))

As a fan of cricket match, it will be very curious to know cumulative number of 50s and 100s throught out the IPL season, for that the above histogram will be helpful. However, from the histogram we can see that the more number in the IPL are reported between the 0 and 10.

BootStrap and Monte Carlo Simulation

Implement at least one permutation test based on a traditional hypothesis test, such as a two-sample t-test or a chi-squared test of independence, to test a hypothesis of interest for your data

# calculating the difference in samples test

x <- mean(matches_o$win_by_runs[matches_o$winner=="Sunrisers Hyderabad"])
y <- mean(matches_o$win_by_runs[matches_o$winner=="Chennai Super Kings"])

# calculating the absolute mean value 

t_sam <- abs(mean(matches_o$win_by_runs[matches_o$winner=="Sunrisers Hyderabad"])-
                         mean(matches_o$win_by_runs[matches_o$winner=="Chennai Super Kings"]))

# observations of sample
n <- length(matches_o$winner)

# number of permutations 
p <- 100

variable <- matches_o$win_by_runs

# Permutation Samples 

PermSamp <- matrix(0, nrow= n, ncol = p)

# Recurring loop for the sample generator 

P_S <- for (i in 1:p) {
  PermSamp[,i] <- sample(variable, size=n, replace= FALSE)
  }

Perm_t_sam <- rep(0,p)

# loop to calculate t-test

P_S_1 <- for (i in 1:p) {
  Perm_t_sam[i] <- abs(mean(PermSamp[matches_o$winner=="Sunrisers Hyderabad",i])-
                              mean(PermSamp[matches_o$winner=="Chennai Super Kings",i]))
  }

# Our hypothesis to check the probability of the Permutated test value greater than the observed test value.

Hypothesis_value <- mean((Perm_t_sam >= t_sam)[1:15])*100

Our main hypothesis is that there will be minimum percentage of samples which will be greater than the observed sample test value. However, finally we got to know the 33.3 % greater than the observed test value for the 15 permutations. Hence we are rejecting our alternate hypothesis which we assumed the values will be “zero”.

Obtain a parametric and nonparametric bootstrap-estimated standard error for at least one statistic of interest

Non-Parametric Bootstrap- estimated error for chi-square test

# Simulating data from distribution
set.seed(1989)
n<- 30

# Initiating data frame as win_by_wickets

observed <- matches$win_by_wickets

# Sample median
median(observed)
## [1] 4
# Number of bootstrap samples
B<-10000

# Instantiating matrix for bootstrap samples
boots <- matrix (NA, nrow=n, ncol=B)

#Sampling with replacement B times
for(b in 1:B) {
  boots[, b] <- observed[sample(1:n, size= n, replace = TRUE)]
}

#Instantiating vector for bootstrap medians
bootMedians <- vector(length= B)

# Sampling with replacement B times
for (b in 1:B) {
boots[, b] <- observed [sample(1:n, size = n, replace = TRUE)]
}

# Instantiating vector for bootstrap medians
bootMedians <- vector (length = B)

# Sampling with replacement B times
for (b in 1:B) {
bootMedians [b] <- median (boots [, b])
}

# Nonparametric estimate of the SE of the sample median
SEestimate <- sd (bootMedians)
SEestimate
## [1] 1.83859

Parametric Bootstrap- estimated error for chi-square test

# Number of bootstrap samples
B < - 10000
## [1] FALSE
#Instantiating matrix for bootstrap samples
paramBoots <- matrix(NA, nrow = n, ncol = B)
XBar <- mean(observed)
s <- sd(observed)

# Simulating a normal set of n values, B times

for(b in 1:B){
  paramBoots[, b] <- rnorm(n = n, mean = XBar, sd = s)
}

# Instantiating vector for bootstrap medians
bootParamMedians <- vector(length = B)

#Calculating median for each simulated data set
for(b in 1:B) {
bootParamMedians[b] <- median(paramBoots[, b])
}


# Nonparametric estimate of the SE of the sample median
SEparamEstimate <- sd(bootParamMedians)
SEparamEstimate
## [1] 0.7585229

Reference

Some of the codes where referred from the class activity and made changes accordingly to fit the data analysis of our project.